Stratified Reservoir Sampling over Heterogeneous Data Streams

نویسندگان

  • Mohammed Al-Kateb
  • Byung Suk Lee
چکیده

Reservoir sampling is a well-known technique for random sampling over data streams. In many streaming applications, however, an input stream may be naturally heterogeneous, i.e., composed of substreams whose statistical properties may also vary considerably. For this class of applications, the conventional reservoir sampling technique does not guarantee a statistically sufficient number of tuples from each substream to be included in the reservoir, and this can cause a damage on the statistical quality of the sample. In this paper, we deal with this heterogeneity problem by stratifying the reservoir sample among the underlying sub-streams. We particularly consider situations in which the stratified reservoir sample is needed to obtain reliable estimates at the level of either the entire data stream or individual sub-streams. The first challenge in this stratification is to achieve an optimal allocation of a fixed-size reservoir to individual sub-streams. The second challenge is to adaptively adjust the allocation as sub-streams appear in, or disappear from, the input stream and as their statistical properties change over time. We present a stratified reservoir sampling algorithm designed to meet these challenges, and demonstrate through experiments the superior sample quality and the adaptivity of the algorithm.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

KSample: Dynamic Sampling Over Unbounded Data Streams

Data sampling over data streams is common practice to allow the analysis of data in real-time. However, sampling over data streams becomes complex when the stream does not fit in memory, and worse yet, when the length of the stream is unknown. A well-known technique for sampling data streams is the Reservoir Sampling. It requires a fixed-size reservoir that corresponds to the resulting sample s...

متن کامل

Exponential Reservoir Sampling for Streaming Language Models

We show how rapidly changing textual streams such as Twitter can be modelled in fixed space. Our approach is based upon a randomised algorithm called Exponential Reservoir Sampling, unexplored by this community until now. Using language models over Twitter and Newswire as a testbed, our experimental results based on perplexity support the intuition that recently observed data generally outweigh...

متن کامل

Prediction of Missing Events in Sensor Data Streams Using Kalman Filters

Sensors and instruments are an important source of real time data. However, sensor networks and instruments and their delivery systems can fail due to intrusion attacks, node failures, link failures, or problems in the measuring instruments. Missing data can cause prediction inaccuracies or problems in the continuous events processing process. Estimation techniques can approximate missing data ...

متن کامل

Unsupervised Instance Selection from Text Streams

Instance selection techniques have received great attention in the literature, since they are very useful to identify a subset of instances (textual documents) that adequately represents the knowledge embedded in the entire text database. Most of the instance selection techniques are supervised, i.e., requires a labeled data set to define, with the help of classifiers, the separation boundaries...

متن کامل

Approximate Integration of streaming data

We approximate analytic queries on streaming data with a weighted reservoir sampling. For a stream of tuples of a Datawarehouse we show how to approximate some Olap queries. For a stream of graph edges from a Social Network, we approximate the communities as the large connected components of the edges in the reservoir. We show that for a model of random graphs which follow a power law degree di...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010